Computer-Implemented System and Method for Efficient Voice Transcription

ABSTRACT

A computer-implemented system and method for efficient voice transcription is provided. A verbal message is processed by splitting the verbal message into segments and generating text for each of the segments via automated speech recognition. A confidence score is assigned to each text segment. The text segments are provided to workbenches, in order, staring with the text segment having a lowest confidence score. For at least one text segment provided to the workbench, one of edits to the text segment and manually transcribed text to replace the text segment are received. A threshold is applied to a time for performing the message processing and upon satisfaction of the threshold, the message processing is terminated. A text message is generated for the verbal message based on one of the generated text segment, manual transcription, or edited text segment for each of the text segments in that verbal message.

CROSS-REFERENCE TO RELATED APPLICATION

This non-provisional patent application is a continuation of U.S. patentapplication Ser. No. 14/074,653, filed on Nov. 7, 2013, pending, whichis a continuation of U.S. Pat. No. 8,583,433, issued on Nov. 12, 2013,which is a continuation of U.S. Pat. No. 8,239,197, issued on Aug. 7,2012, which is a continuation-in-part of U.S. Pat. No. 8,032,373, issuedon Oct. 4, 2011, which is a divisional of U.S. Pat. No. 7,330,538,issued on Feb. 12, 2008, which was based on Provisional PatentApplication Ser. No. 60/368,644, filed on Mar. 28, 2002, the prioritydates of which are claimed and the disclosures of which are incorporatedby reference.

FIELD

The invention relates in general to speech recognition and,specifically, to a computer-implemented system and method for efficientvoice transcription.

BACKGROUND

People typically communicate with each other either verbally, e.g., inface-to-face conversations or via some form of telephone/radio; or, inwritten messages. Traditionally, written communications have been in theform of hand written or typed notes and letters. More recently, theInternet has made communication by chat and email messages a preferredform of communication.

Telephone systems are designed to convey audio signals that facilitateverbal communications. However, since the recipient of a telephone callis often not available to receive it, voice mail systems have beendeveloped to record verbal messages so that they that can be heard bythe intended recipient at a later time. Periodically, the intendedrecipient can access their voice mail system via telephone or cell phoneto hear the voice mail messages recorded from telephone calls that theymissed receiving. However, a person may need to access several differentvoice mail accounts at different times during a day. For example, it isnot unusual to have a voice mail account for a cell phone, another for ahome phone, and yet another for an office phone.

For many people, it would be more convenient to receive allcommunications in text format rather than having to repeatedly accessverbal messages stored in different locations. In regard to receivingthe communications stored as verbal messages in multiple voice mailaccounts, it would thus be easier for a person to receive emails orother forms of text messages that convey the content of the verbalmessages, since it would then not be necessary for the person to call avoice mail account, and enter the appropriate codes and passwords toaccess the content of those accounts. Accordingly, it would be desirableto provide an efficient and at least semi-automated mechanism fortranscribing verbal messages to text, so that the text can be providedto an intended recipient (or to a program or application programminginterface (API) that uses the text). This procedure and system need notbe limited only to transcribing voice mail messages, but could beapplied for transcribing almost any form of verbal communication to acorresponding text. Ideally, the system should function so efficientlythat the text message transcription is available for use within only afew minutes of the verbal message being submitted for transcription.

One approach that might be applied to solve this problem would use fullyautomated speech recognition (ASR) systems to process any voice orverbal message in order to produce corresponding text. However, eventhough the accuracy of an ASR program such as Nuance's Dragon Dictate™program has dramatically improved compared to the earlier versions whentrained to recognize the characteristics of a specific speaker's speechpatterns, such programs still have a relatively high error rate whenattempting to recognize speech produced by a person for which the systemhas not been trained. The accuracy is particularly poor when the speechis not clearly pronounced or if the speaker has a pronounced accent.Accordingly, it is currently generally not possible to solely rely on anautomated speech recognition program to provide the transcription tosolve the problem noted above.

Furthermore, if a service is employed to provide the transcription ofverbal messages to text, the queuing of the verbal messages to betranscribed should be efficient and scalable so as to handle a varyingdemand for the service. The number of verbal messages that a service ofthis type would be required to transcribe is likely to vary considerablyat different times of the day and during week days compared to weekends.This type of service can be relatively labor intensive since thetranscription cannot be provided solely by automated computer programs.Accordingly, the system that provides this type of service must becapable of responding to varying demand levels in an effective and laborefficient manner. If overloaded with a higher demand for transcriptionthan the number of transcribers then employed can provide, the systemmust provide some effective manner in which to balance quality andturnaround time to meet the high demand, so that the system does notcompletely fail or become unacceptably backlogged. Since a service thatuses only manual transcription would be too slow and have too high alabor cost, it would be desirable to use both ASR and manualtranscription, to ensure that the text produced is of acceptablequality, with minimal errors.

It has been recognized that specific portions of verbal messages tend tobe easier to understand than other portions. For example, the initialpart of a verbal message and the closing of the message are often spokenmore rapidly than the main body of the message, since the user puts morethought into the composition of the main body of the message.Accordingly, ASR of the rapidly spoken opening and closing portions of averbal message may result in higher errors in those parts of themessage, but fewer errors than the main body of the verbal message. Itwould be desirable to use a system that takes such considerations intoeffect when determining the portion of the message on which to applymanual editing or transcription, and to provide some automated approachfor determining which portions of a message should be manuallytranscribed relative to those portions that might be acceptable if onlyautomatically transcribed by an ASR program.

SUMMARY

In consideration of the preceding discussion, an exemplary method hasbeen developed for transcribing verbal messages into text. This methodincludes the steps of receiving verbal messages over a network andqueuing the verbal messages in a queue for processing into text. Atleast portions of successive verbal messages from the queue areautomatically processed with online processors using an automated speechrecognition program (ASR) to produce corresponding text. Whole verbalmessages or segments of the verbal messages that have been automaticallyprocessed are assigned to selected workbench stations for furtherediting and transcription by operators using the workbench stations. Theoperators at the workbench stations to which the whole or the segmentsof the verbal messages have been assigned can listen to the verbalmessages, correct errors in the text that was produced by the automaticprocessing, and transcribe portions of the verbal messages that have notbeen automatically processed by the ASR program. The resulting productcomprises final text messages or segments of final text messagescorresponding to the verbal messages that were in the queue. Segments ofthe text messages produced by the operators at the workbench stationsare assembled from the segments of the verbal messages that wereprocessed and, along with whole text messages corresponding to the wholeverbal messages that were processed, are used to provide final outputtext messages.

The method further includes the step of validating a format of theverbal message and a return address that can be used for delivery of anoutput text message, before enabling queuing of each verbal message tobe transcribed.

Verbal messages can be assigned to specific online processors in accordwith predefined assignment rules, so that the online processor used isappropriate to automatically transcribe the type of verbal messageassigned to it. Whole verbal messages can be simultaneously sent to theonline processors for processing using the ASR program and to a queuefor processing by an operator at one of the workbench stations.

Audio content in a verbal message can be separated from associatedmetadata. The associated metadata can include one or more elements suchas proper nouns, and if the verbal message is a voice mail can includethe caller's name, and the name of the person being called. Both theaudio content and the metadata for the verbal messages in the queue canbe input to the online processors for improving accuracy of the ASRprogram.

The step of automatically processing can include the steps of checkingfor common content patterns in the verbal messages to aid in automatedspeech recognition; and checking automatically recognized speech using apattern matching technique to identify any common message formats.

The method can further include the step of breaking up at least some ofthe verbal messages into the segments based on predefined rules. Forexample, the verbal message can be broken into the segments at pointswhere silence is detected, such as between words or phrases, and thesegments can be required to have a predefined maximum duration. Also,the segments can be selected so that they have between a predefinedminimum and a predefined maximum number of words. Confidence ratings canbe assigned to the segments of the verbal messages that wereautomatically recognized by the ASR program. Then, the verbal message,the automatically recognized text, a timeline for the verbal message,and the confidence ratings of the segments can be input to a workbenchpartial message queue. Furthermore, segments that have a confidencerating above a predefined level can be withheld from the workbenchpartial message queue, based on a high probability that theautomatically recognized text is correct and does not require manualediting.

The step of assigning whole verbal messages or segments of verbalmessages can include the steps of assigning the whole verbal messages orthe segments of verbal messages to a specific workbench station used byan operator who is eligible to process verbal messages of that type.Also, segments of verbal messages having a lower quality can be assignedto workbench stations first, to ensure that such segments aretranscribed with a highest quality, in a time allotted to process eachof the verbal messages.

The operators at the workbench stations can edit and controltranscription of the verbal messages in a browsing program display.Transcription of the whole verbal messages can be selectively carriedout in one of three modes, including a word mode that includes keyboardinputs for specific transcription inputs, a line mode that facilitateslooping through an audible portion of the verbal message to focus on asingle line of transcribed text at a time, and a whole message mode, inwhich the operator working at the workbench station listens to the wholeverbal message so that it can be transcribed to produce thecorresponding text. Transcription of parts of a verbal message iscarried out by an operator at a workbench station viewing a display of agraphical representation of an audio waveform for at least a part of theverbal message. A segment to be transcribed can be visually indicated inthis displayed graphical representation.

The method can further include the step of applying post processing totext corresponding to the verbal messages that were transcribed, inorder to correct minor errors in the text.

If it appears that editing the automatically produced text for a wholeverbal message by an operator on a workbench station will exceed arequired turn-around-time, the method can include the step ofimmediately post processing the automatically produced text withoutusing any edits provided by any operator at a workbench station.Further, if it appears that editing parts of the verbal message willexceed the required turn-around-time, the method can include the step ofpost processing any text of the verbal message that was automaticallyrecognized and has a confidence rating that is greater than a predefinedminimum, along with any segments of the verbal message that have alreadybeen edited or transcribed by an operator on a workbench station, andany text of the verbal message that was automatically recognized and wasmoved into a workbench station queue but has not yet been edited by anoperator at a workbench station.

The step of producing final output text messages can include the stepsof making the final output text messages available to an end user bytransmitting the final output text messages to the end user inconnection with an email message transmitted over the network, a shortmessage service (SMS) message transmitted over the network and through atelephone system, a file transmitted over the network to a programinterface, and a file transmitted over the network to a web portal.

The method can also include the step of employing edits made to textthat was produced by the ASR program by operators at the workbenchstations as feedback. This feedback will then be used to improve anaccuracy of the ASR program.

Another aspect of the present novel approach is directed to a system forefficiently transcribing verbal messages that are provided to the systemover a network, to produce corresponding text. The system includes aplurality of processors coupled to the network, for receiving andprocessing verbal messages to be transcribed to text. These processorimplement functions that are generally consistent with the steps of themethod discussed above.

A further embodiment provides a computer-implemented system and methodfor efficient verbal transcription. A verbal message is processed bysplitting the verbal message into segments and generating text for eachof the segments via automated speech recognition. A confidence score isassigned to each text segment. The text segments are provided toworkbenches, in order, staring with the text segment having a lowestconfidence score. For at least one text segment provided to theworkbench, one of edits to the text segment and manually transcribedtext to replace the text segment are received. A threshold is applied toa time for performing the message processing and upon satisfaction ofthe threshold, the message processing is terminated. A text message isgenerated for the verbal message based on one of the generated textsegment, manual transcription, or edited text segment for each of thetext segments in that verbal message.

This application specifically incorporates by reference the disclosuresand drawings of each patent application and issued patent identifiedabove as a related application.

This summary has been provided to introduce a few concepts in asimplified form that are further described in detail below in theDescription. However, this Summary is not intended to identify key oressential features of the claimed subject matter, nor is it intended tobe used as an aid in determining the scope of the claimed subjectmatter.

BRIEF DESCRIPTION OF THE DRAWINGS

Various aspects and attendant advantages of one or more exemplaryembodiments and modifications thereto will become more readilyappreciated as the same becomes better understood by reference to thefollowing detailed description, when taken in conjunction with theaccompanying drawings, wherein:

FIG. 1 is a simplified block diagram showing exemplary elements of oneapplication of a system in accord with the present approach, forefficiently transcribing verbal (voice mail) messages to text;

FIG. 2 is an overview functional flowchart showing exemplary steps forprocessing verbal messages and transcribing them to producecorresponding text messages;

FIG. 3 is a functional flowchart showing exemplary steps for processinginbound verbal messages;

FIG. 4 illustrates exemplary details of the whole message processor fromFIG. 2;

FIG. 5 is a functional flowchart showing exemplary steps implemented bythe split and merge processor of FIG. 2;

FIG. 6 is a functional flowchart showing exemplary steps carried out bythe workbench scheduler assigner of FIG. 2;

FIG. 7 is a functional flowchart showing exemplary steps carried out byone of the workbenches of FIG. 2;

FIG. 8 is a functional flowchart showing further details of the finalmessage finalization and delivery;

FIG. 9 is a functional flowchart showing further details of the qualityfeedback process;

FIG. 10 is a functional flowchart showing further details performed inconnection with the SLA timers of FIG. 2; and

FIG. 11 is a schematic block diagram of an exemplary generallyconventional computing device that is suitable for use in carrying outfunctions performed by various portions of the exemplary systemdescribed herein.

DETAILED DESCRIPTION Figures and Disclosed Embodiments Are Not Limiting

Exemplary embodiments are illustrated in referenced Figures of thedrawings. It is intended that the embodiments and Figures disclosedherein are to be considered illustrative rather than restrictive. Nolimitation on the scope of the technology and of the claims that followis to be imputed to the examples shown in the drawings and discussedherein.

Overview of Exemplary Application for Transcription Service

FIG. 1 illustrates a functional block diagram 20 showing an exemplaryapplication of the present novel approach for efficiently transcribingverbal messages to text. This exemplary application is directed to useof this technology for transcribing voice mail messages, although it isnot intended that the technology be in any way limited to that specificapplication. In the simple illustration of FIG. 1, conventionaltelephones 22 and 24 are employed at different times by two partiesattempting to place a telephone call to an intended recipient. In eachcase, the intended recipient is unavailable to take the call, eitherbecause the person is away from a telephone to which the calls areplaced, or because the intended recipient is already talking on thattelephone to a different party.

As is often the case, the intended recipient may actually have multiplevoice mail systems to receive calls directed to different telephones;however, in this simple example, the intended recipient uses a singlevoice mail call center 26 to receive telephone calls that fail to reachthat person when placed to one or more telephones used by the person.Furthermore, in this example, the intended recipient prefers to receivetext transcriptions of any voice mail messages received by voice mailcall center 26, which are recorded in a data store 28. To satisfy therequirement of this customer and others to receive corresponding textmessages instead of checking one or more voice mail stores, the voicemail call center transmits the voice mail messages for the person to aservice (as shown within the dash-line rectangle) that makes use of thepresent novel approach to produce corresponding text. The voice mailmessages are input for automated speech recognition (ASR) processing, asindicated in a block 31, producing automatically recognized textcorresponding to a least a portion of the voice mail messages submittedto the service for transcription. The voice mail messages and the textthat has been automatically recognized are then provided to one or moreworkbench stations for additional processing by a human agent, in ablock 33. The additional processing by human operators manning eachworkbench includes editing of the automatically recognized text, and/orfurther manual transcription of any portions of the voice mail messagesthat have not been automatically recognized during the ASR processing.The resulting text produced using the one or more workbench stations isstored in data storage 35 and then subsequently provided to the personwho was the intended recipient of the voice mail messages that have beentranscribed (or to a software program), as indicated in a block 37.

FIG. 2 is a functional block diagram 30 illustrating further details ofan exemplary method and system in accord with the present novel approachfor efficiently transcribing verbal messages to text and represents ageneral overview that starts with a new verbal message 32 being receivedfor transcription into text by the system. New verbal message 32includes both audio content in the form of an audio file, and metadatarelated to the verbal message. The metadata can include proper nounsassociated with the message and, if the verbal message is a voice mailmessage, the metadata would include the name of the calling party, andthe name of the person who was called. Also, since the textcorresponding to the verbal message must be transmitted to an end user,each new verbal message that is received should include a callbacknetwork address or uniform resource locator (URL), to which the textshould be directed after the verbal message has been transcribed by theservice.

New verbal messages 32 are input to an inbound message processor 34,which validates each new verbal message, as described in greater detailbelow. After a verbal message is validated, it is input to a new messageassignment processor 36, which assigns the verbal messages to specificonline processors 38, based on a set of “assignment rules.” The serverswill normally include one or more online processors that are used forthe ASR processing.

The verbal messages are handled in two different ways to carry out theASR processing. In some cases, whole verbal messages are processed bythe ASR software program, producing automatically recognized text forthe entire message. In other cases, the verbal message is split intoparts, and only some of the parts may be automatically recognized by theASR software program. The verbal messages that were input to the onlineprocessors and the automatically recognized text produced by the ASRsoftware program are then output to a workbench scheduled assigner 46,which places these materials into a workbench queue.

The workbench queue provides input to one or more workbench stations 48that are used by human agents. As noted above, these human agents reviewthe automatically recognized text, editing it to correct errors, andalso manually transcribe any portions of the verbal messages that werenot automatically recognized. For those messages that were split intoparts, portions of a message may be processed by a plurality of humanagents at different workbenches, and the text produced by those agentsis then reassembled to produce an overall text message corresponding tothe original verbal message that was split into parts.

The output from the one or more workbench stations is input to a messagefinalization process 50. The finalization process corrects typographicaland spelling errors in the text, producing output text that is input toa message delivery block 52, which prepares the text for delivery to anend user or software program that will use the text, as indicated in amessage output block 54. In addition, message delivery block 52 alsoprovides the original verbal message and all of the edits made by humanagents manning the one or more workbenches as feedback to a qualityfeedback process 56 so that the ASR software program can improve itsspeech recognition accuracy to correct the errors noted by the humanagents in the automatically recognized text previously produced by theASR program.

The service providing the transcription of verbal messages to text maybe required to commit to providing transcribed text for each verbalmessage received by the service within a specific time limit. If so, aservice level agreement (SLA) might impose penalties (monetary) fordelays in completing the transcription of verbal messages to text.Accordingly, FIG. 2 includes SLA timers 58, which are employed todetermine if the service is meeting the transcription time limits agreedto in the contracts with parties subscribing to the service. Furtherdetails regarding SLA timers 58 and each of the other blocks shown inFIG. 2 are discussed below.

Further Details of the Exemplary Method and System

The functions carried out by inbound message processor 34 areillustrated in FIG. 3. When processing new verbal messages 32, inboundmessage processor 34 validates the audio format of the audio contentportion of the verbal message in a decision step 60. If the format ofthe audio content is invalid, a step 62 provides for returning an errorcode, which then terminates further processing of that verbal message ina step 64. However, if the message format is valid, as indicated in astep 66, the audio content that has been extracted is stored in anetworked storage 68. Also, the verbal message is queued in a clientqueue 70 to await further processing. Verbal messages are processed fromclient queue 70 at a step 72, which carries out new message assignmentlogic, checking the queue for new verbal messages, for example, everyfive seconds.

The new message assignment logic assigns verbal messages to the onlineprocessors based on a predefined set of assignment rules. For example,the assignment rules can select an online processor for processing averbal message based upon the type of content, e.g., voice mail, a to dolist, a conference call, etc., a priority level of the verbal messages,and other criteria, as appropriate.

FIG. 4 illustrates further details of the steps carried out by wholemessage processor 40, which implements one set of functions of onlineprocessors 38. The whole message processor sends a new verbal message,which includes the audio content and metadata, to ASR program 42. Asnoted above, the metadata includes proper nouns, and may include thecaller and person being called in regard to voice mail messages. Themetadata are used to improve the accuracy of the ASR process.

Simultaneously, whole message processor 40 sends the new verbal messageto a workbench whole message input queue 80. As soon as the ASR processhas completed automatic recognition of the verbal text, the results arelinked to the verbal message within the workbench whole message inputqueue and together, the results and corresponding verbal message aremade available to a workbench station used by an agent for processingthe whole verbal message. It should be noted that a whole message maysometimes be assigned to an agent at a workbench station before theautomatically recognized text from the ASR processing is available, toavoid delaying the processing of a verbal message. Workbench wholemessage queue 80 is made available to the workbench scheduled assignerto facilitate further manual processing, as discussed below.

Split and merge processor 44, which is included in online processors 38,sends the audio content from a verbal message to ASR 42 and also to apattern matcher 90 (as shown in FIG. 5), which looks for patterns in theaudio content. A decision step 92 determines if any common formats havebeen detected within the audio content portion of the verbal message,such as common patterns corresponding to frequently used phrases orsentences. For example, a verbal message might include the common phrase“[T]he following was left by a caller . . . ,” which would be morelikely to be accurately recognized. In the event that a common messageformat is detected within the audio content, there is no need to sendthat portion of the audio content to the workbench for further manualprocessing by a human agent. Instead, that portion of the message isinput to final message processing. However, the split and merge messageprocessor sends other portions of a verbal message that do not matchthat of a common message format to a message analyzer 96. Similarly, ASRprocessing 42 produces automatically recognized text results that areinput to a pattern matcher 94, which looks for commonly used textpatterns. Message analyzer 96 breaks up the message into segments atpoints in the message where there is silence, or after a specifiedduration of time. This step also ensures that a minimum and maximumnumber of words are included per segment, in accord with predefinedrules.

Each word and fragment input to the message analyzer is assigned aconfidence rating. Next, the message analyzer supplies: (a) the verbalmessage; (b) the automatically recognized text provided by the ASRprocess; (c) a timeline for processing the verbal message; and, (d) theconfidence rating that was assigned to automatically recognized portionsof the message—all to a workbench partial message queue 98. Segmentsthat were automatically recognized by the ASR and have a confidencerating above a certain predefined level are withheld from the workbenchpartial message queue, as indicated in a step 100, since they do notrequire any additional processing by a human agent and can instead beoutput for final assembly into a text message corresponding to theverbal message from which the segments were derived. The segments thatwere input to workbench partial message queue 98 are now ready forassignment to a workbench station for further manual editing and/ortranscription by a human agent.

Further details relating to the functions carried out by workbenchscheduled assigner 46 are illustrated in FIG. 6. Whole message queue 80includes all of the messages that require editing and/or transcriptionby a human agent at a workbench station. Similarly, partial messagequeue 98 includes segments of messages that require editing and/orfurther processing by a human agent. At a predefined and configurablefrequency, for example, every 15 seconds, workbench scheduled assigner46 checks for new whole messages in whole message queue 80 and partialmessages in partial message queue 98. Each message in these queues isassigned to a human agent according to a rank for the agent that isbased on a quality, a fatigue factor, and a performance or speed.Quality, fatigue factor, and performance are attributes assigned to eachtranscription human agent, and are employed by the message assignmentalgorithm to determine which human agent will be assigned the verbalmessages or parts of messages for editing/transcription. Quality is ameasurement of the error rate of a human agent and is relatively static(since it isn't recalculated or updated frequently). Fatigue factor is ameasure the amount of idle time a human agent has between messages(i.e., a greater amount of idle time corresponds to a lower fatiguefactor) and is typically recalculated many times during a single shiftof each human agent. Performance measures the agent's work rate, e.g., adetermination of the time required for a human agent to edit/transcribea predefined number of seconds of verbal message audio data. It will beunderstood that these criteria are only exemplary and many othercriteria might instead be used for determining the human agent that isassigned messages from the queues to edit/transcribe.

Not only is the ASR processing useful for assisting the human agents intranscribing verbal messages, and for dividing up the verbal messageinto partial sections, it is also used for deciding the assignment orderof the partial sections for editing and transcription by the humanagents. In carrying out this function, the ASR processing ensures thatdifficult sections (i.e., sections having a low machine confidence levelin regard to accurate automated transcription) are assigned to the humanagents before easy ones. In addition, high-performing human agents arepreferably selected before slower or lower-quality human agents inediting and transcribing the more difficult portions of verbal messages.ASR processing also assists the system to perform well (although,perhaps with a higher error level) when the verbal message volume beingsubmitted for transcription exceeds the capability of the availablehuman agents to process. Thus, if there is a spike in verbal messagetranscription traffic, the system does not bog down and fail to meet itsoperating requirements due to a backlog of work that is increasingfaster than the transcription service can process it. Instead, the moredifficult portions of the verbal messages that have been automaticallyrecognized, but have the lowest machine confidence levels are assignedout to human agents for editing and transcription and the remainder ofthe verbal messages will be completed using the text automaticallyrecognized by the ASR processing, but in a gradual fashion. Accordingly,the higher the system load requirements for transcribing verbalmessages, the higher will be the percentage of the text messages that isproduced by ASR processing.

The workbench scheduled assigner determines how many human agents areonline at the workbench stations. It should be understood that agentscan use a workbench station from a remote location that is accessed overa network, e.g., the Internet, and these human agents may be located inmany diverse geographic locations throughout the world. The human agentwho carries out the editing and transcription of messages using aworkbench station must have an excellent understanding of the languagein which the verbal messages are spoken, but that language need not bethe native language of the agent. Considerable labor savings can beachieved by using agents located in certain portions of the world inwhich labor rates are relatively low, without adversely affecting thequality of the editing and transcription of messages provided by suchagents.

When determining which agents might be used for processing a whole orpartial message, the workbench scheduled assigner determines the agentswho are not already working on a message and the agents who are eligibleto work on the type of content now available in each of the queues. Themessages, partial or whole, are assigned to the human agents based onthe message rank, agent availability, and based upon whether aparticular agent is eligible to receive a specific type of messagecontent. For example, verbal messages of a technical nature shouldlogically only be assigned to human agents who can understand atechnical vocabulary. In making the assignment of partial or wholemessages, workbench scheduled assigner 46 will generally assign messagesegments of lower quality to the agents first, to insure that the outputproduced by the agent processing that message is of the highest quality,particularly given the constraints in the time applied to transcribingeach message when SLA timers 58 (FIG. 2) are in use.

The functions implemented by a human agent using a workbench station areillustrated in FIG. 7. In a step 110, the arrival of a whole or partialmessage that has been assigned to a specific workbench for manualprocessing causes an audible sound to be produced, e.g., the sound of achime, to warn that a new message or portion of the message has justarrived. The workbench station comprises a web browser-based softwareprogram that enables a human agent to edit and transcribe whole orpartial messages. The human agent is able to operate on a message withinan inbox and process an entire page of text without using mouse clicks,since the browser program comprising the work bench station employskeyboard shortcuts. In addition, the workbench station program includesaudio controls for playing, pausing, rewinding, and fast forwardingthrough a verbal message while the human agent transcribes it to producecorresponding text.

One of three different modes of transcription can be selected for awhole message, including a word mode 116 that includes shortcuts on thekeyboard for capitalization, number conversion, and alternate wordchoices; a line mode 114 that provides for looping through the audio,enabling an agent to focus on a single line of transcription at a timewhen producing corresponding text; and, a whole message mode 112. Thus,when a whole message is received, the workbench station can selectivelybe used in whole message mode 112, line mode 114, or word mode 116. Ifused in whole message mode 112, the workbench station program enablesthe human agent to edit or transcribe the entire message, producingcorresponding text, which is then input to a proofread text step 122,producing output that is submitted for transmission to an end user (oran application program). If either line mode 114 or word mode 116 isselected by the human agent, the agent can process the line or word,editing it or transcribing it. A decision step 118 then determines ifthe end of the message has been reached. If not, a step 120 enables thehuman agent to edit or transcribe the next part of the whole message ineither the line or word mode.

If a partial message is received for processing by the human agent atthe workbench station, a step 126 provides for partial messagetranscription. In this case, the workbench station program displays agraphical representation of the audio waveform comprising the partialverbal message, in a step 128. In this graphical representation, thesegment that is to be transcribed by the agent is highlighted. Inaddition, segments preceding and following the highlighted segmentcurrently being transcribed are also displayed (when available), toprovide context to the current segment. When processing automaticallyrecognized text produced by the ASR program, as shown in a decision step130, the human agent has the option of editing that text in a step 132,or replacing it completely with manually transcribed text that the agentbelieves to be more accurate, in a step 136. A decision step 134determines if the partial message transcription is completed and if not,proceeds to the next part of the partial message in a step 138,returning again to graphical representation step 128. Once the partialmessage has been fully transcribed (or edited), the process againproceeds with step 122. It should be noted that proofreading of either awhole message or of a partial message that has been edited and/ortranscribed is mandatory before the text that is produced is submittedfor delivery to the end user in step 124. Submission of the textproduced by the agent also then causes the workbench scheduler assignorto check for the next message that is to be processed by the agent onthe workbench station. Further, the workbench station notifies thetranscription server when a whole or partial message has been completelyedited and/or transcribed.

There is a clear advantage to employing a plurality of different humanagents working at different workbench stations to simultaneously editand/or transcribe different segments of a message, since the processingof a verbal message can be completed much more rapidly with suchparallel processing. Further, by first processing the portions orsegments of a verbal message that have been assigned a lower confidencerating, if insufficient time is available (within the constraintsimposed by the SLA timers) to complete the processing of a message usingthe workbench stations, the human agents will be employed for processingonly the more difficult portions of the message, thereby maintaining theoverall quality of the message once it is assembled from the segmentsthat have been automatically recognized with a high confidence rating,but not processed by human agents, and those segments that have beenprocessed by human agents.

FIG. 8 illustrates the steps of the procedure for message finalizationand delivery. A partial message reassembler 150 receives automaticallyrecognized text message segments produced by the ASR program and partialmessage segments that have been processed by one or more human agents.Entire messages are then reassembled from these segments, starting withthe segments that were automatically recognized and were produced by theASR program, and adding segments processed by one or more human agentsat one or more workbench stations. Once the entire message has beenreassembled in text form, post processing is applied to the whole textmessage by a message text post processor 152.

Message text post processor 152 receives whole or partial messagesproduced by the audio content pattern matcher and the text patternmatcher, along with whole messages that have been edited and/ortranscribed by a human agent using a workbench station. The postprocessing applied to reassembled messages and to whole messagesincludes the application of filters for checking formatting. Forexample, such filters ensure that the letter “I” is capitalized in thepronoun, and that the word “I'm” is properly capitalized and includesthe apostrophe. In addition, post processing corrects commonlymisspelled words and adds hyphens within the text, e.g., after pauses inthe verbal message to improve readability.

Following post processing, text messages are delivered to the networkaddress specified when the verbal message was received by the service,such as an Internet URL. The text produced by transcribing the verbalmessage can be made available to an end user via a transmission in ane-mail, through a short message service (SMS) transmission, or suppliedto an application program interface (API) as a callback. As a furtheralternative, the text can be added to a message store through a webportal specified by the URL or other network address that was includedwith the verbal message originally submitted for transcription.

It is generally recognized that the accuracy of an ASR program can beimproved by providing quality feedback, which is the intention ofquality feedback process 56, as illustrated in FIG. 9. In this process,a step 160 provides for sending all of the edits, along with theoriginal automatically recognized text produced by the ASR program, backto the ASR service. A step 162 then batch processes this feedbackinformation to identify sounds, words, and/or phrases that were editedor changed by the human agent in the automatically recognized text, sothat these corrections can be employed to improve the accuracy of futurespeech recognition by the ASR engine. The result of the batch processingstep is employed in a step 164 to update the acoustic models that areembedded in the ASR engine, thereby improving its accuracy. A furtheraspect of this process is implemented in a step 166, which provides formonitoring on a continuing basis the differences between theautomatically recognized text and the text that is manually edited by ahuman agent, so that ASR quality can be continually tracked over time,to ensure that it is not degrading, but is instead, improving.

It should be emphasized that while this exemplary embodiment includesSLA timers 58, these timers are only included to ensure that theprocessing of verbal messages is completed within time limits that arecontractually established in agreements between the parties submittingverbal messages for transcription, and the service. Further details thatare employed in connection with this requirement are illustrated in FIG.10. As provided by a contractual agreement, each verbal message has arequired turn-around-time (TAT) in which the transcription of a verbalmessage into text must be completed. Throughout the process fortranscribing the verbal message, timestamps are saved to monitor theamount of time required by each step of the process. Various componentsof the process are provided with these timers to ensure that the overallTAT for a verbal message does not exceed the guaranteed SLA TAT.

If it appears that the processing of a whole message is going to causethe overall TAT for that verbal message to exceed the SLA TAT, theprocedure calls for immediate post processing of automaticallyrecognized text, which will then be transmitted to the end-user. In thiscase, manual processing by a human agent at a workbench station is notapplied to the automatically recognized text, but instead, theautomatically recognized text is used as is. If it appears that the SLATAT time is about to expire for partial message, the text message thatis post processed and transmitted to the end user will include: (a) anyautomatically recognized text message segments having a sufficientlyhigh confidence rating; (b) segments of the message that have alreadybeen processed by a human agent at a workbench station; and, (c) anyadditional automatically recognized text produced by the ASR program,which has not yet been edited by human agent at a workbench station. Asnoted above, segments of a verbal message are processed by human agentsin order starting from those with the lowest quality to those with thehighest quality, thereby insuring the high-quality text is provided inthe output text message. Any segments or whole messages remaining in aqueue after the SLA timer has been processed for that message areremoved from the queue.

In summary, a step 170 provides for monitoring the timers for each phaseof the transcription process. A decision step 172 determines if furtherprocessing by a human agent at a workbench station will cause the TAT toexceed the SLA. If so, a step 174 insures that the automaticallyrecognized text produced by the ASR program will be employed withoutfurther human agent editing or transcription. Conversely, a negativeresult to decision step 172 leads to a step 176, which continuesprocessing by a human agent using a workbench station.

Exemplary Computing Device for Use in the Present System

FIG. 11 illustrates an exemplary computing system 200 that is suitablefor use as a computing device employed for implementing various aspectsof the novel approach described above, i.e., for providing efficienttranscription of verbal messages to text. Computing system 200 can beemployed for carrying out the initial ASR function and for controllingthe queuing of verbal messages and parts of messages provided to eachworkbench, and then reassembling the text portions of the messagesproduced at a plurality of workstations to produce the output textmessages. It will be appreciated that the present approach is veryscalable to meet the demand for transcribing verbal messages. Forimplementing a transcription service that is national or eveninternational in scope, which is certainly reasonable using the datacommunication capabilities of the Internet, a plurality of computingsystems 200 will likely be employed for an exemplary system as describedabove, and these may be disposed at geographically disparate locations,for example, based upon the cost of providing the specific functions ata location or its proximity to the location of the demand for thetranscription services.

It is emphasized that computing system 200 is exemplary and that some ofthe components described below may not be required or even used inconnection with the functions that the computing system provides in thetranscription system. In this example, computing system 200 includes aprocessor 212 that is coupled in communication with a generallyconventional data bus 214. Also coupled to the data bus is a memory 216that includes both random access memory (RAM) and read only memory(ROM). Machine instructions are loaded into memory 216 from storage on ahard drive 218 or from other suitable non-volatile memory, such as anoptical disk or other optical or magnetic media. These machineinstructions, when executed by processor 212 can carry out a pluralityof different functions employed to implement the approach as describedherein, as well as other functions.

An input/output (I/O) interface 220 that includes a plurality ofdifferent types of ports, such as serial, parallel, universal serialbus, PS/2, and Firewire ports, is coupled to data bus 214 and is in turnconnected to one or more input devices 224, such as a keyboard, mouse,or other pointing device, enabling a user to interact with the computingsystem and to provide input and control the operation of the computingsystem. A display interface 222 couples a display device 226 to the databus, enabling a browser program window and other graphic and textinformation to be displayed for viewing by a user, e.g., if computingsystem 200 comprises a client computing device. The computing system iscoupled to a network and/or to the Internet 230 (or other wide areanetwork) via a network interface 228, which couples to data bus 214.Through the network interface, the computing system is able to accessverbal messages that are stored on or provided by other computingdevices sites 232 a-232 n, wherein the subscript “n” on “other computingdevice 232 n” can be a very large number, e.g., indicating that thereare potentially many other computing devices in communication withcomputing system 200 over the Internet (or other network).

Although the concepts disclosed herein have been described in connectionwith the preferred form of practicing them and modifications thereto,those of ordinary skill in the art will understand that many othermodifications can be made thereto within the scope of the claims thatfollow. Accordingly, it is not intended that the scope of these conceptsin any way be limited by the above description, but instead bedetermined entirely by reference to the claims that follow.

What is claimed is:
 1. A computer-implemented method for efficientverbal transcription, comprising: processing a verbal message,comprising: splitting the verbal message into segments; generating textfor each of the message segments via automated speech recognition;assigning a confidence score to the generated text of each segment;providing the text segments to one or more workbenches, in order,staring with the text segment having a lowest confidence score; andreceiving for at least one text segment provided to the workbench, oneof edits to the text segment and manually transcribed text to replacethe text segment; applying a threshold to a time for performing themessage processing; upon satisfaction of the threshold, terminating themessage processing; and generating a text message for the verbal messagebased on one of the generated text segment, manual transcription, oredited text segment for each of the text segments in that verbalmessage.